1 Executive Summary

1.1 Overview

The purpose of this memo is to explore data on Airbnb listings in Barcelona and understand the key drivers of price, from the viewpoint of 2 people looking to visit the city for 4 nights.

1.2 Methodology

What is the best model to predict the price of 2 adults visiting Barcelona for 4 nights?

We started our project with Exploratory Data Analysis (EDA), which offered insights into the types, content, and relevance of the existing variables. We then progressed to our models, after checking for the correlation of our independent variables, in order to find an optimal regression model that will predict the price.

1.3 Findings

1.3.1 Exploratory Data Analysis

This section explores the steps we have taken to tidy and understand the data:

  • Data Wrangling, since not all variables were of the appropriate type
  • Rejection of variables that were incomplete or only had “NA” observations
  • Adjusted the levels of some variables by clustering and categorising them
  • Created new cleaned variables
  • Excluded NA observations in the variables we deemed relevant

After selecting and adjusting relevant variables, we created summary statistics, charts, tables, and correlation matrices to visualise the data. During that, we concluded some correlation between [insert findings]

1.3.2 Regression Analysis

This section showcases our work on examining relevant independent variables and constructing a number of different regression models to identify the optimal model, which should be the one which explains the highest portion of variance in the price of Airbnb listings in Barcelona.

1.3.2.1 Optimal Model

  • Predictor variables across the following categories: prop_type_simplified review_scores_rating room_type minimum_nights bathrooms bedrooms accommodates instant_bookable neighbourhood_simplified availability_30 reviews_per_month host_is_superhost host_response_rate host_acceptance_rate review_scores_cleanliness review_scores_checkin review_scores_location review_scores_value

  • Problematic collinearity: We include several variables from the same category in our price prediction model. There is no significant collinearity problem for our model, since, statistically, we have Square of GVIF^(1/(2*DF)) less than 5, which suggests there is no significant collinear relationships between our variables used for price prediction. However, there might be some further adjustments to out best model to improve the R-square for better price prediction, such as including more factors in some other different categories and also to make sure the collinearity of our variables become as small as possible.

  • The estimated stay for 2 adults spending four nights in Barcelona would cost is 835.0494, with 95% confidence interval of [512.9052,1613.538].

2 Loading the Data

3 Exploratory Data Analysis

In our EDA, we will attempt to answer to following questions:

  1. Which variables are relevant?
  2. Does the data need cleaning?
  3. Do we have outliers which we need to exclude?
  4. Are there variables we need to convert?
  5. Are there any additional variables we need to create based on the manipulation of existing data?
  6. What is the relationship between the relevant variables? (Through this, we want to examine how successfully independent variables explain dependent variables)
  7. What are the dependent and independent variables?review_scores_rating: Average review score (0 - 100)

3.1 Raw Values

We use glimpse() to have a look at the data. There are 16,206 observations across 74 variables and presumably we might not actually need all of them, because some might be overlapping and some irrelevant.

glimpse(listings)
Rows: 16,206
Columns: 74
$ id                                           <dbl> 18674, 23197, 32711, 3498~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021091e+13, 2.021091e+1~
$ last_scraped                                 <date> 2021-09-10, 2021-09-11, ~
$ name                                         <chr> "Huge flat for 8 people c~
$ description                                  <chr> "110m2 apartment to rent ~
$ neighborhood_overview                        <chr> "Apartment in Barcelona l~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 71615, 90417, 135703, 731~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Mireia And Maria", "Etai~
$ host_since                                   <date> 2010-01-19, 2010-03-09, ~
$ host_location                                <chr> "Barcelona, Catalonia, Sp~
$ host_about                                   <chr> "We are Mireia (43) & Mar~
$ host_response_time                           <chr> "within an hour", "within~
$ host_response_rate                           <chr> "100%", "100%", "100%", "~
$ host_acceptance_rate                         <chr> "83%", "70%", "100%", "83~
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, TRUE~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "la Sagrada Família", "El~
$ host_listings_count                          <dbl> 35, 2, 5, 3, 4, 4, 4, 4, ~
$ host_total_listings_count                    <dbl> 35, 2, 5, 3, 4, 4, 4, 4, ~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ neighbourhood                                <chr> "Barcelona, CT, Spain", "~
$ neighbourhood_cleansed                       <chr> "la Sagrada Família", "el~
$ neighbourhood_group_cleansed                 <chr> "Eixample", "Sant Martí",~
$ latitude                                     <dbl> 41.40556, 41.41291, 41.40~
$ longitude                                    <dbl> 2.17262, 2.22063, 2.17015~
$ property_type                                <chr> "Entire rental unit", "En~
$ room_type                                    <chr> "Entire home/apt", "Entir~
$ accommodates                                 <dbl> 8, 6, 6, 9, 2, 1, 1, 2, 1~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "2 baths", "2 baths", "1.~
$ bedrooms                                     <dbl> 3, 3, 2, 4, 1, 1, 1, 1, 1~
$ beds                                         <dbl> 6, 5, 3, 6, 1, 1, 1, 1, 1~
$ amenities                                    <chr> "[\"TV\", \"Wifi\", \"Hai~
$ price                                        <chr> "$121.00", "$229.00", "$1~
$ minimum_nights                               <dbl> 1, 4, 2, 4, 2, 2, 2, 2, 2~
$ maximum_nights                               <dbl> 1125, 300, 31, 365, 60, 6~
$ minimum_minimum_nights                       <dbl> 1, 4, 2, 3, 2, 2, 2, 2, 2~
$ maximum_minimum_nights                       <dbl> 3, 7, 2, 4, 2, 2, 2, 2, 2~
$ minimum_maximum_nights                       <dbl> 1125, 1125, 31, 365, 1125~
$ maximum_maximum_nights                       <dbl> 1125, 1125, 31, 365, 1125~
$ minimum_nights_avg_ntm                       <dbl> 1.6, 4.1, 2.0, 4.0, 2.0, ~
$ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 31, 365, 1125~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 1, 22, 0, 15, 0, 0, 0, 0,~
$ availability_60                              <dbl> 26, 52, 0, 36, 0, 0, 0, 0~
$ availability_90                              <dbl> 50, 82, 0, 66, 0, 0, 0, 0~
$ availability_365                             <dbl> 58, 128, 44, 156, 163, 16~
$ calendar_last_scraped                        <date> 2021-09-10, 2021-09-11, ~
$ number_of_reviews                            <dbl> 21, 52, 63, 154, 358, 309~
$ number_of_reviews_ltm                        <dbl> 0, 0, 0, 5, 45, 56, 39, 2~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 1~
$ first_review                                 <date> 2014-03-02, 2016-01-05, ~
$ last_review                                  <date> 2019-10-11, 2019-12-15, ~
$ review_scores_rating                         <dbl> 4.40, 4.74, 4.31, 4.52, 4~
$ review_scores_accuracy                       <dbl> 4.55, 4.92, 4.27, 4.66, 4~
$ review_scores_cleanliness                    <dbl> 4.75, 4.92, 4.38, 4.62, 4~
$ review_scores_checkin                        <dbl> 4.80, 4.94, 4.81, 4.65, 4~
$ review_scores_communication                  <dbl> 4.90, 4.98, 4.79, 4.67, 4~
$ review_scores_location                       <dbl> 4.75, 4.67, 4.81, 4.72, 4~
$ review_scores_value                          <dbl> 4.30, 4.69, 4.40, 4.48, 4~
$ license                                      <chr> "HUTB-002062", "HUTB-0050~
$ instant_bookable                             <lgl> TRUE, FALSE, TRUE, FALSE,~
$ calculated_host_listings_count               <dbl> 19, 2, 3, 2, 4, 4, 4, 4, ~
$ calculated_host_listings_count_entire_homes  <dbl> 19, 2, 3, 2, 0, 0, 0, 0, ~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 0, 0, 4, 4, 4, 4, 1~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 0.23, 0.75, 0.61, 1.54, 4~

We use skim() to identify missing values, and get a better picture of the data set. We make the following observations: - Some numeric variables are listed as character - There are a lot of missing values for character and numeric variables - Especially for the ratings, a lot of values are missing - There are many irrelevant variables

skim(listings)
Data summary
Name listings
Number of rows 16206
Number of columns 74
_______________________
Column type frequency:
character 25
Date 5
logical 7
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 16206 0
name 10 1.00 1 255 0 15662 0
description 118 0.99 1 1000 0 14728 0
neighborhood_overview 6556 0.60 1 1000 0 7139 0
picture_url 0 1.00 61 126 0 15925 0
host_url 0 1.00 38 43 0 7521 0
host_name 40 1.00 1 35 0 3469 0
host_location 52 1.00 2 166 0 609 0
host_about 6149 0.62 1 5756 0 3946 5
host_response_time 40 1.00 3 18 0 5 0
host_response_rate 40 1.00 2 4 0 68 0
host_acceptance_rate 40 1.00 2 4 0 93 0
host_thumbnail_url 40 1.00 55 106 0 7477 0
host_picture_url 40 1.00 57 109 0 7477 0
host_neighbourhood 4898 0.70 3 33 0 131 0
host_verifications 0 1.00 2 156 0 276 0
neighbourhood 6556 0.60 5 55 0 74 0
neighbourhood_cleansed 0 1.00 5 44 0 72 0
neighbourhood_group_cleansed 0 1.00 6 19 0 10 0
property_type 0 1.00 4 35 0 55 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 19 1.00 6 17 0 36 0
amenities 0 1.00 2 2173 0 13656 0
price 0 1.00 5 9 0 530 0
license 6007 0.63 1 243 0 4894 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-10 2021-09-11 2021-09-11 2
host_since 40 1.00 2008-09-19 2021-08-24 2016-03-02 3004
calendar_last_scraped 0 1.00 2021-09-10 2021-09-11 2021-09-11 2
first_review 4417 0.73 2011-01-04 2021-09-10 2018-10-07 2702
last_review 4417 0.73 2011-06-23 2021-09-10 2020-02-15 1760

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 40 1 0.16 FAL: 13562, TRU: 2604
host_has_profile_pic 40 1 1.00 TRU: 16107, FAL: 59
host_identity_verified 40 1 0.78 TRU: 12590, FAL: 3576
bathrooms 16206 0 NaN :
calendar_updated 16206 0 NaN :
has_availability 0 1 0.99 TRU: 16066, FAL: 140
instant_bookable 0 1 0.45 FAL: 8931, TRU: 7275

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.793740e+07 15720291.91 1.867400e+04 1.500257e+07 2.928164e+07 4.190451e+07 5.218653e+07 <U+2586><U+2585><U+2586><U+2587><U+2587>
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 1.176105e+08 122466396.31 3.073000e+03 9.024729e+06 6.134564e+07 2.160151e+08 4.199209e+08 <U+2587><U+2582><U+2582><U+2582><U+2581>
host_listings_count 40 1.00 2.520000e+01 117.79 0.000000e+00 1.000000e+00 3.000000e+00 1.600000e+01 2.178000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 40 1.00 2.520000e+01 117.79 0.000000e+00 1.000000e+00 3.000000e+00 1.600000e+01 2.178000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 4.139000e+01 0.01 4.135000e+01 4.138000e+01 4.139000e+01 4.140000e+01 4.146000e+01 <U+2581><U+2587><U+2585><U+2581><U+2581>
longitude 0 1.00 2.170000e+00 0.02 2.090000e+00 2.160000e+00 2.170000e+00 2.180000e+00 2.230000e+00 <U+2581><U+2582><U+2587><U+2585><U+2581>
accommodates 0 1.00 3.400000e+00 2.24 0.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.600000e+01 <U+2587><U+2585><U+2581><U+2581><U+2581>
bedrooms 558 0.97 1.680000e+00 1.02 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 306 0.98 2.340000e+00 1.94 0.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 1.348000e+01 32.94 1.000000e+00 1.000000e+00 3.000000e+00 3.000000e+01 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 7.067200e+02 492.75 1.000000e+00 1.200000e+02 1.125000e+03 1.125000e+03 3.000000e+03 <U+2586><U+2587><U+2581><U+2581><U+2581>
minimum_minimum_nights 1 1.00 1.376000e+01 33.33 1.000000e+00 1.000000e+00 2.000000e+00 3.000000e+01 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 1 1.00 1.589000e+01 35.24 1.000000e+00 2.000000e+00 3.000000e+00 3.000000e+01 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 1 1.00 8.019400e+02 538.78 1.000000e+00 3.300000e+02 1.125000e+03 1.125000e+03 9.999000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 1 1.00 2.658599e+05 23856502.75 1.000000e+00 3.600000e+02 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 1 1.00 1.473000e+01 34.14 1.000000e+00 2.000000e+00 3.000000e+00 3.000000e+01 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 1 1.00 2.652532e+05 23802528.85 1.000000e+00 3.450000e+02 1.125000e+03 1.125000e+03 2.142625e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 9.230000e+00 11.07 0.000000e+00 0.000000e+00 3.000000e+00 1.800000e+01 3.000000e+01 <U+2587><U+2582><U+2581><U+2581><U+2582>
availability_60 0 1.00 2.273000e+01 23.23 0.000000e+00 0.000000e+00 1.500000e+01 4.500000e+01 6.000000e+01 <U+2587><U+2581><U+2582><U+2582><U+2583>
availability_90 0 1.00 3.871000e+01 35.80 0.000000e+00 0.000000e+00 3.900000e+01 7.400000e+01 9.000000e+01 <U+2587><U+2581><U+2582><U+2583><U+2585>
availability_365 0 1.00 1.648600e+02 136.38 0.000000e+00 9.000000e+00 1.515000e+02 3.080000e+02 3.650000e+02 <U+2587><U+2583><U+2582><U+2583><U+2586>
number_of_reviews 0 1.00 3.340000e+01 66.33 0.000000e+00 0.000000e+00 4.000000e+00 3.400000e+01 8.530000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 3.140000e+00 7.65 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 1.450000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_l30d 0 1.00 6.500000e-01 1.76 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.040000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 4417 0.73 4.500000e+00 0.75 0.000000e+00 4.400000e+00 4.670000e+00 4.890000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 4582 0.72 4.640000e+00 0.54 0.000000e+00 4.550000e+00 4.800000e+00 4.960000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 4579 0.72 4.580000e+00 0.56 0.000000e+00 4.470000e+00 4.730000e+00 4.930000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 4585 0.72 4.740000e+00 0.48 0.000000e+00 4.690000e+00 4.890000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 4580 0.72 4.730000e+00 0.50 0.000000e+00 4.670000e+00 4.880000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 4586 0.72 4.750000e+00 0.39 0.000000e+00 4.680000e+00 4.860000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 4587 0.72 4.500000e+00 0.54 0.000000e+00 4.360000e+00 4.620000e+00 4.810000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 1.697000e+01 32.72 1.000000e+00 1.000000e+00 3.000000e+00 1.600000e+01 1.770000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 1.445000e+01 32.26 0.000000e+00 0.000000e+00 1.000000e+00 1.200000e+01 1.770000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 2.320000e+00 9.07 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 9.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 8.000000e-02 0.82 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.500000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 4417 0.73 1.160000e+00 1.42 1.000000e-02 1.800000e-01 6.900000e-01 1.660000e+00 2.400000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

3.2 Answering our EDA Questions - some of the steps we have taken below

  • Some of the pieces of data are of the wrong type or contain irrelevant data. For example, “price” is not only a “character” variable, but also contains the troublesome “$” symbol, which must be removed before proper analysis can begin. Also, the “bathrooms_text” data seems to be of use to our analysis, but first we need to once again turn it into a double-type data and remove the pesky text from it.

  • Some of the listings are located in irrelevant cities or countries. The “host_location” variable has observations of listings which are not located in Barcelona, and as such we must filter out pieces of data from irrelevant locations in the data frame.

  • There are some variables that seem rather irrelevant. For example, the “last_scraped” data, “name” and other such data seem a bit useless for analytical purposes, so we have to filter them out.

  • There are NA values which we must omit for our analysis to be as accurate as possible.

  • For the convenience of coding, comparing data and analysing, we converted the following 4 variables to numeric: price, bathrooms_text, host_response_rate, host_acceptance_rate.

  • We created “property_type” by grouping data into 5 categories based on frequency distribution to control from the largest subgroups. Our 5 categories are “Entire rental unit”,“Private room in rental unit”, “Entire serviced apartment”,“Entire condominium (condo)”, “Other”.

  • We created “neighbourhood_group_cleansed” by categorizing neighbourhoods into 4 different categories: North, Center, Coastal Line and West.

  • In addition, we created 5 data frames: host_specific, property_specific, reviews_specific, logical_specific, categorical_specific. For each of the 5 data frames, we have an overview of distribution and characteristics of each variable to see which variables might have the strongest power of predicting the price.

  • In the correlation matrix, blue represents for positive correlation and red represents for negative correlation. The darker the color is, the stronger the correlation between the variables. From our correlation matrix graph, we can see that the correlations between any two of bedroom, beds and accommodates are positive and relatively high. Also, another group of variables has relatively high positive correlation is the review-related variables, such as review scores cleanliness, review scores checkin, review scores communication, review scores location and review scores value. I think it is reasonable since these variables all describe reviews so they are somehow link to each other. Otherwise, variables have relatively weak negative correlation as indicated by very light red colour.

  • Price is the dependent variable, while factors that might influence the price and people’s preferences regarding the Barcelona AirBnB are independent variable, for example, property type, number of reviews, neighbourhood and so on.

3.3 Creating our working data frame: Barcelona

We start out by cleansing out the entire data frame for only entries in which “Barcelona” is mentioned so that henceforth we know we are only wrangling relevant geographical data.

barcelona <- listings %>% 
  filter(host_location %in% c("Barcelona", 
                              "BARCELONA", 
                              "Barcelona, Barcelona, Spain", 
                              "Barcelona, BARCELONA, Spain",
                              "Barcelona, Catalonia, Spain",
                              "Barcelona, Cataluña, Spain",
                              "Barcelona, Catalunya, Spain")) 

Having done this, we can now filter out the data frame for specific variables that we wish to analyse, turn the problematic pieces of data into their correct data types, and remove the NA data from our frame (thus fixing the other issues mentioned above). After thorough consideration, we decided that the factors which are most interesting and relevant to our EDA are in relation to the characteristics of the hosts themselves and their properties.

barcelona <- listings %>%  
  select(
         id,
         host_response_rate,
         host_acceptance_rate,
         host_is_superhost,
         host_listings_count,
         host_has_profile_pic,
         host_identity_verified,
         neighbourhood_cleansed,
         has_availability,
         instant_bookable,
         neighbourhood_group_cleansed,
         latitude,
         longitude,
         property_type,
         room_type,
         property_type,
         accommodates,
         bathrooms_text,
         bedrooms,
         beds,
         price,
         minimum_nights,
         maximum_nights,
         availability_30,
         number_of_reviews,
         review_scores_rating,
         review_scores_accuracy,
         review_scores_cleanliness,
         review_scores_checkin,
         review_scores_communication,
         review_scores_location,
         review_scores_value,
         reviews_per_month,
         instant_bookable
         )


# We convert the price from character to numeric
barcelona$price <- as.numeric(gsub("\\$","",barcelona$price))

#We convert bathrooms_text into a variable with an integer number of bathrooms
barcelona$bathrooms <- substr(barcelona$bathrooms_text, 1,2)
barcelona$bathrooms <- as.numeric(gsub("\\.","",barcelona$bathrooms))

#We convert the host_response_rate variable to numeric
barcelona$host_response_rate <- as.numeric(gsub("\\%","",barcelona$host_response_rate))
barcelona$host_acceptance_rate <- as.numeric(gsub("\\%","",barcelona$host_acceptance_rate))

#We convert the host_response_rate variable to numeric
barcelona$host_acceptance_rate <- as.numeric(gsub("\\%","",barcelona$host_acceptance_rate))
barcelona$host_acceptance_rate <- as.numeric(gsub("\\%","",barcelona$host_acceptance_rate))

#omit NA variables
barcelona <- na.omit(barcelona)
skim(barcelona)
Data summary
Name barcelona
Number of rows 8159
Number of columns 33
_______________________
Column type frequency:
character 5
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 44 0 69 0
neighbourhood_group_cleansed 0 1 6 19 0 10 0
property_type 0 1 4 35 0 46 0
room_type 0 1 10 15 0 4 0
bathrooms_text 0 1 6 16 0 26 0

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.22 FAL: 6342, TRU: 1817
host_has_profile_pic 0 1 1.00 TRU: 8148, FAL: 11
host_identity_verified 0 1 0.90 TRU: 7321, FAL: 838
has_availability 0 1 1.00 TRU: 8154, FAL: 5
instant_bookable 0 1 0.48 FAL: 4229, TRU: 3930

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 24747418.28 16027052.03 18674.00 9825059.50 24344024.00 39353745.50 51939670.00 <U+2587><U+2585><U+2586><U+2586><U+2586>
host_response_rate 0 1 92.23 17.00 0.00 92.00 100.00 100.00 100.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
host_acceptance_rate 0 1 85.73 24.48 0.00 83.00 97.00 100.00 100.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
host_listings_count 0 1 22.15 40.71 0.00 2.00 5.00 21.00 411.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1 41.39 0.01 41.35 41.38 41.39 41.40 41.46 <U+2581><U+2587><U+2585><U+2581><U+2581>
longitude 0 1 2.17 0.02 2.09 2.16 2.17 2.18 2.23 <U+2581><U+2582><U+2587><U+2585><U+2581>
accommodates 0 1 3.97 2.33 1.00 2.00 4.00 5.00 16.00 <U+2587><U+2583><U+2581><U+2581><U+2581>
bedrooms 0 1 1.85 1.04 1.00 1.00 2.00 2.00 10.00 <U+2587><U+2582><U+2581><U+2581><U+2581>
beds 0 1 2.71 2.03 0.00 1.00 2.00 4.00 20.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
price 0 1 107.37 89.89 9.00 47.00 91.00 134.00 999.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1 9.38 16.73 1.00 1.00 3.00 5.00 600.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1 740.52 482.78 1.00 180.00 1125.00 1125.00 3000.00 <U+2585><U+2587><U+2581><U+2581><U+2581>
availability_30 0 1 10.17 10.12 0.00 0.00 8.00 18.00 30.00 <U+2587><U+2583><U+2582><U+2582><U+2582>
number_of_reviews 0 1 52.33 78.92 1.00 4.00 18.00 68.00 853.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 0 1 4.55 0.51 1.00 4.41 4.67 4.87 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 0 1 4.64 0.51 1.00 4.53 4.78 4.93 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 0 1 4.59 0.52 1.00 4.47 4.72 4.91 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 0 1 4.72 0.46 1.00 4.67 4.87 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 0 1 4.71 0.49 1.00 4.66 4.86 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 0 1 4.76 0.36 1.00 4.69 4.86 5.00 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 0 1 4.49 0.52 1.00 4.34 4.60 4.79 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
reviews_per_month 0 1 1.35 1.51 0.01 0.30 0.93 1.91 24.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
bathrooms 0 1 1.33 0.62 0.00 1.00 1.00 2.00 11.00 <U+2587><U+2581><U+2581><U+2581><U+2581>

Having now pulled the specific variables we want to analyse in our data frame and tidy it up a bit, we then proceed to separate property types based on specific subgroups.

#we group the property types into 5 categories according to frequency distribution to only control for the largest subgroups
barcelona <- barcelona %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit", "Entire serviced apartment","Entire condominium (condo)") ~ property_type, 
    TRUE ~ "Other"
  ))

#We check whether the categorization has been performed correctly
barcelona %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n)) %>% 
  kbl(col.names = c("Property Type", "Category", "Count"),caption = "Count by Property Type") %>% 
  kable_styling() 
Count by Property Type
Property Type Category Count
Entire rental unit Entire rental unit 4846
Private room in rental unit Private room in rental unit 1929
Entire serviced apartment Entire serviced apartment 271
Entire condominium (condo) Entire condominium (condo) 155
Entire loft Other 129
Room in boutique hotel Other 123
Private room in condominium (condo) Other 103
Private room in hostel Other 85
Private room in residential home Other 65
Room in hotel Other 64
Shared room in hostel Other 55
Entire residential home Other 39
Private room in loft Other 30
Private room in bed and breakfast Other 27
Entire guest suite Other 26
Private room in guest suite Other 25
Room in hostel Other 25
Room in serviced apartment Other 23
Shared room in rental unit Other 15
Private room in serviced apartment Other 14
Boat Other 12
Entire townhouse Other 12
Entire guesthouse Other 11
Private room in casa particular Other 11
Entire villa Other 7
Private room Other 7
Room in aparthotel Other 5
Room in bed and breakfast Other 5
Shared room in bed and breakfast Other 5
Private room in dome house Other 4
Private room in floor Other 4
Private room in townhouse Other 4
Private room in villa Other 4
Private room in chalet Other 3
Private room in guesthouse Other 3
Casa particular Other 2
Tiny house Other 2
Barn Other 1
Entire home/apt Other 1
Entire place Other 1
Private room in boat Other 1
Private room in farm stay Other 1
Shared room in guesthouse Other 1
Shared room in hotel Other 1
Shared room in loft Other 1
Shared room in serviced apartment Other 1
#We exclude listings that have minimum nights requirement greater than 4
barcelona <- barcelona %>% 
  filter(minimum_nights<=4)

#Here we categorize the grouped neighbourhoods in Barcelona in 4 different categories: North, Center, Coastal Line, and West
barcelona <- barcelona %>% 
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_group_cleansed %in% c("Horta-Guinardó","Nou Barris","Sarrià-Sant Gervasi") ~ "North",
    neighbourhood_group_cleansed %in% c("Eixample", "Gracia","Ciutat Vella") ~ "Center",
    neighbourhood_group_cleansed %in% c("Sant Martí","Sants-Montjuïc") ~ "Coastal line outside Center",
    TRUE ~ "West"
  ))

barcelona %>%
  count(neighbourhood_group_cleansed, neighbourhood_simplified) %>%
  arrange(desc(n))%>% 
  kbl(col.names = c("Neighbourhood", "Neighbourhood's Category", "Count"),caption = "Count and Category by Neighbourhood") %>% 
  kable_styling()
Count and Category by Neighbourhood
Neighbourhood Neighbourhood’s Category Count
Eixample Center 2670
Ciutat Vella Center 968
Sants-Montjuïc Coastal line outside Center 689
Sant Martí Coastal line outside Center 599
Gràcia West 518
Sarrià-Sant Gervasi North 186
Horta-Guinardó North 160
Les Corts West 108
Sant Andreu West 57
Nou Barris North 45

Now, to make our analysis more comprehensive, we further split our data frame into 5 different categories:

  • Host Specific: Data frame looking specifically at data relating to hosts themselves
  • Property Specific: Data frame looking at specific characteristics of the properties
  • Reviews Specific: Data frame looking at how users reviewed specific aspects of these properties and hosts
  • Logical Specific: Data frame analysing whether listings meet specific requirements
  • Categorical Specific: Data frame looking at the different neighbourhoods and types of properties we listed above
host_specific <- barcelona %>% 
  select(host_response_rate,
         host_acceptance_rate,
         host_listings_count
         )

property_specific <- barcelona %>% 
  select(accommodates,
         bedrooms,
         beds,
         bathrooms,
         )

reviews_specific <- barcelona %>% 
  select(number_of_reviews,
         review_scores_accuracy,
         review_scores_checkin,
         review_scores_cleanliness,
         review_scores_location,
         review_scores_communication,
         review_scores_value,
         review_scores_rating
         )

logical_specific <- barcelona %>% 
  select(host_is_superhost,
         host_has_profile_pic,
         host_identity_verified,
         has_availability,
         instant_bookable)

categorical_specific <- barcelona %>% 
  select(neighbourhood_group_cleansed,
         prop_type_simplified,
         room_type)

Firstly, we look at the property-related variables, mutate the data frames we created for these to include new variables explaining how well the properties accommodate guests in terms of bathrooms, and then create density charts to visualise the sizes of the properties which we are exploring.

property_specific <- property_specific %>% 
  mutate(bathrooms_per_guest=bathrooms/accommodates,
         bedrooms_per_guest=bedrooms/accommodates)


property_specific_longer <- property_specific %>% 
  select(bathrooms_per_guest,
         bedrooms_per_guest,
         bathrooms,
         bedrooms,
         beds,
         accommodates) %>% 
  pivot_longer(names_to="variable_name", values_to="values",everything())

ggplot(property_specific_longer, aes(x=values), na.rm=TRUE)+
  geom_density(fill="grey")+
  facet_wrap(vars(variable_name), scales="free", ncol=3)+
  labs(title="Property related variables affecting prices of Airbnb in Barcelona", x="", y="Density")+
  theme(axis.title = element_text()) + 
  theme(axis.text.y=element_blank())+
  theme_bw()+
  NULL

ggpairs(property_specific)

Looking at the distribution, we see that beds and accommodates seem to have a similar distribution. Thus, they might be correlated, if that’s the case we should only use one of them for our model. The other distributions are not unified, thus they might have an influence on prices.

In general, we can see a high positive correlation between bedrooms and beds, roughly 0.8, which suggests that we would better include one of them at a time for our model prediction. There is relatively low negative correlation between bedrooms/ bathrooms and number of bedrooms/ bathrooms per guest. Thus, we could include both at the same time in our model.

barcelona$host_response_rate <- barcelona$host_response_rate/100
barcelona$host_acceptance_rate <- barcelona$host_acceptance_rate/100

host_specific <- barcelona %>% 
  select(host_response_rate,
         host_acceptance_rate,
         host_listings_count,
         ) %>% 
  pivot_longer(names_to= "data_names", values_to="data_values", everything())

ggplot(host_specific, aes(x = data_values), na.rm=TRUE) +
  geom_density(fill = "blue") +
  facet_wrap(vars(data_names), scales="free") +
  labs(x = "", y = "Density", title = "Host-Specific Variables affecting Prices of AirBnB in Barcelona") +
  theme_bw() +
  NULL

ggpairs(host_specific)

Looking at these charts, we can see that the listings count is very unified, thus it might not influence prices. Acceptance rate seems to be the least unified, hence it could be interesting for our regression model. Last, we can note that the response rate shows a similar pattern to the acceptance rate, so they might be correlated.

reviews_specific %>%
  pivot_longer(names_to= "data_names", values_to="data_values", everything())%>%
  ggplot(aes(x = data_values), na.rm=TRUE) +
  geom_density(fill = "red",alpha = 0.4) +
  facet_wrap(vars(data_names), scales="free") +
  labs(x = "", y = "Density", title = "Reviews-Specific Variables affecting Prices of AirBnB in Barcelona") +
  theme_bw() +
  NULL

ggpairs(reviews_specific)

All the distributions are extremely skewed to the left. Most of them are very unified and might thus not be interesting to include in our model. The least unified variables that could be considered are: review score cleanliness, review score rating and review score value.

From the ggpair, we see there are high positive correlations among pairs of any two of the reviews_specific variables, except for the very first row, number of reviews. This is reasonable because review_scores_cleanliness, review_scores_communication, review_scores_value and review_scores_raing are somehow related to each other. Thus, we would better to include only one of these variables in our model prediction. (Note that, although we include both review_scores_rating and review_scores_value in our best model, we would check their collinearity again when we build our model to ensure there is no significant collinear relationships between our variables.)

#rename logical variables
colnames(logical_specific) <- c("Host is a Superhost", "Host has a Profile Picture", "Host's Identity is Verified", "Property is Available", "Property is Instantly Bookable")

#create table to better visualise logical variables
logical_visualisation <- logical_specific %>%
  
  #pivot table to perform next steps
  pivot_longer(cols = 1:5,
               names_to = "var",
               values_to = "logical") %>% 
  
  #group by variable and logical result (True or False)
  group_by(var, logical) %>%
  
  #count number of True and False for each variable
  summarise(count = n()) %>% 
  
  #pivot wider to have one row per variable and one column for each "True" and "False" count
  pivot_wider(names_from = "logical",
              values_from = "count") %>%
  
  #Format table with kableExtra, renmaing columns, adding title
  kbl(col.names = c("Logical Variable", "Count of True", "Count of False"), caption = "Count of True and False by Logical Variable")%>% 
  kable_styling()


#display table
logical_visualisation
Count of True and False by Logical Variable
Logical Variable Count of True Count of False
Host’s Identity is Verified 541 5459
Host has a Profile Picture 7 5993
Host is a Superhost 4720 1280
Property is Available 5 5995
Property is Instantly Bookable 2787 3213

Looking at the above table and considering what each variable represents we can conclude that: - most flats are not available, this potentially reflects a high demand and could influence prices (for instance the cheapest flats or the flats with the best price/quality balance are probably already booked) - a majority (~55%) of flats are instantly bookable, however, the distribution of True vs False is the most balanced of all variables. Consequently, this could have an important impact on the price. - almost all the hosts have a profile picture, thus this probably has no impact on the price. - a large majority (~80%) of hosts are superhosts, considering that superhosts are hosts who received great ratings, this could have a slight impact on the price. - a large majority (~80%) of hosts’ identities are verified, considering this is an indicator of security, it could slightly influence the price.

categorical_specific_longer <- categorical_specific %>%
  pivot_longer(names_to="variable_name", values_to="values",everything()) %>% 
  group_by(variable_name,values) %>% 
  mutate(observations=n())

ggplot(categorical_specific_longer, aes(y=reorder(values,observations)),
                                        na.rm=TRUE)+
  geom_bar(fill="grey",orientation = "y")+
  facet_wrap(vars(variable_name), scales="free", ncol=1)+
  labs(title="Categorical variables affecting prices of Airbnb in Barcelona",
             x="", y="Density")+
  theme_bw()+
  NULL

We can see that none of the variable is unified, thus they all have the potential to be useful in predicting variability in prices.

4 Mapping

4.1 Map of AirBnBs in Barcelona by Price

#we set the color for price to red and create the palette
price_color <- colorNumeric(palette = "Reds",
                            domain = c(1:200),
                            reverse = FALSE)

#We will use the listings data frame for the mapping, so we have to convert price to numeric here as well
listings$price <- as.numeric(gsub("\\$","",listings$price))

#Adding circle markers with popups and colors depending on price range
leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~price_color(price), 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

There is a direct correlation between property prices and proximity to the city center. There appear to be clusters of properties with distinctively higher prices, which could be explained by proximity to touristic locations or transportation connections

location_clusters <-leaflet(data = filter(listings, minimum_nights <= 4)) %>%
  
  # Map with OpenStreetMap.Mapnik
  addProviderTiles("OpenStreetMap.Mapnik") %>% 

  # Add circle markers with popups and labels
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type,
                   
                   # cluster 
                   clusterOptions = markerClusterOptions())
#print location_clusters
location_clusters

There appears to be a direct correlation between the number of properties and the proximity to the city center. Largest clustering in “Ciutat Vela” and ”Eixample” which are known mainly for their popular tourist spots and renowned architecture. At the same time, areas closer to the center boast easier transportation connections to most Barcelona neighbourhoods and viccinity to nightlife avenues.

5 Regression Analysis

5.1 Creating the price_4_nights variable

We create price variable price_4_nights and log it, for our model regression.

barcelona_analysis <- barcelona %>% 
  
  #filter for 2 people
  filter(accommodates >= 2) %>%
  
  #calculate price for 4 nights 
  mutate(price_4_nights = price * 4,
         logprice_4_nights = log(price_4_nights))

#create desnity plot
ggplot(data = barcelona_analysis, aes(price_4_nights)) +
  geom_density()+
  #add title and axis titles
  labs(title = "Density Plot of the Price for 2 People for  4 Nights", 
       y = "Density", 
       x = "Price for 2 for 4 Nights")+
  #change theme
  theme_bw()+
  NULL

#create density plot with logarithmic scale
ggplot(data = barcelona_analysis, aes(logprice_4_nights)) +
  geom_density()+
  #add title and axis titles
  labs(title = "Density Plot of the Price for 2 People for 4 Nights - Logarithmic Scale", 
       y = "Density - Logarithmic", 
       x = "Price for 2 for 4 Nights")+
  #change theme
  theme_bw()+
  NULL

glimpse(barcelona)
Rows: 6,000
Columns: 35
$ id                           <dbl> 18674, 23197, 32711, 34981, 35379, 35388,~
$ host_response_rate           <dbl> 1.00, 1.00, 1.00, 1.00, 0.90, 0.90, 0.90,~
$ host_acceptance_rate         <dbl> 0.83, 0.70, 1.00, 0.83, 1.00, 1.00, 1.00,~
$ host_is_superhost            <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TR~
$ host_listings_count          <dbl> 35, 2, 5, 3, 4, 4, 4, 4, 1, 8, 5, 4, 10, ~
$ host_has_profile_pic         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ host_identity_verified       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ neighbourhood_cleansed       <chr> "la Sagrada Família", "el Besòs i el Mare~
$ has_availability             <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ instant_bookable             <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU~
$ neighbourhood_group_cleansed <chr> "Eixample", "Sant Martí", "Gràcia", "Ciut~
$ latitude                     <dbl> 41.40556, 41.41291, 41.40566, 41.37978, 4~
$ longitude                    <dbl> 2.17262, 2.22063, 2.17015, 2.17623, 2.150~
$ property_type                <chr> "Entire rental unit", "Entire rental unit~
$ room_type                    <chr> "Entire home/apt", "Entire home/apt", "En~
$ accommodates                 <dbl> 8, 6, 6, 9, 2, 1, 1, 2, 1, 4, 6, 6, 2, 3,~
$ bathrooms_text               <chr> "2 baths", "2 baths", "1.5 baths", "3 bat~
$ bedrooms                     <dbl> 3, 3, 2, 4, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1,~
$ beds                         <dbl> 6, 5, 3, 6, 1, 1, 1, 1, 1, 1, 0, 4, 1, 1,~
$ price                        <dbl> 121, 229, 144, 189, 41, 31, 31, 46, 36, 5~
$ minimum_nights               <dbl> 1, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,~
$ maximum_nights               <dbl> 1125, 300, 31, 365, 60, 60, 60, 60, 65, 3~
$ availability_30              <dbl> 1, 22, 0, 15, 0, 0, 0, 0, 14, 2, 0, 0, 20~
$ number_of_reviews            <dbl> 21, 52, 63, 154, 358, 309, 310, 260, 79, ~
$ review_scores_rating         <dbl> 4.40, 4.74, 4.31, 4.52, 4.72, 4.77, 4.83,~
$ review_scores_accuracy       <dbl> 4.55, 4.92, 4.27, 4.66, 4.83, 4.84, 4.92,~
$ review_scores_cleanliness    <dbl> 4.75, 4.92, 4.38, 4.62, 4.65, 4.76, 4.82,~
$ review_scores_checkin        <dbl> 4.80, 4.94, 4.81, 4.65, 4.92, 4.90, 4.93,~
$ review_scores_communication  <dbl> 4.90, 4.98, 4.79, 4.67, 4.93, 4.92, 4.91,~
$ review_scores_location       <dbl> 4.75, 4.67, 4.81, 4.72, 4.77, 4.71, 4.85,~
$ review_scores_value          <dbl> 4.30, 4.69, 4.40, 4.48, 4.74, 4.78, 4.83,~
$ reviews_per_month            <dbl> 0.23, 0.75, 0.61, 1.54, 4.06, 3.49, 3.61,~
$ bathrooms                    <dbl> 2, 2, 1, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,~
$ prop_type_simplified         <chr> "Entire rental unit", "Entire rental unit~
$ neighbourhood_simplified     <chr> "Center", "Coastal line outside Center", ~
correlation_matrix_numeric <- barcelona %>%
  select(host_acceptance_rate,  host_listings_count, latitude, longitude, accommodates, bedrooms, beds, price, minimum_nights, maximum_nights, availability_30, number_of_reviews, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month, bathrooms) %>%
  cor()

corrplot(correlation_matrix_numeric, is.corr = FALSE,
         method="color",
         type="full",
         title="Correlation matrix of numerical variables",
         tl.cex=.6,
         tl.col="black",
         cl.ratio=.3)

Correlation <- cor(correlation_matrix_numeric , use="pairwise.complete.obs")

5.2 We will use the normally distributed logprice_4_nights for our analysis.

set.seed(10)
# We create train_test_barcelona that is 75% for training and 25% for testing
train_test_barcelona <- initial_split(barcelona_analysis, prop = 0.75) 
airbnb_listing_train <- training(train_test_barcelona)
airbnb_listing_test <- testing(train_test_barcelona)

5.3 Model 1

First, we fit a regression model called model1 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, and review_scores_ratingas explanatory variables.

model1 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = barcelona_analysis)
car::vif(model1)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.018706  4        1.002319
number_of_reviews    1.047886  1        1.023663
review_scores_rating 1.046606  1        1.023038
model1 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 341.7275137 1.084801 71.6735262 0.0000000 291.3261850 400.8486007
prop_type_simplifiedEntire rental unit 0.9783626 1.050812 -0.4413504 0.6589765 0.8877732 1.0781959
prop_type_simplifiedEntire serviced apartment 1.3557471 1.060666 5.1675163 0.0000002 1.2079092 1.5216792
prop_type_simplifiedOther 0.6453538 1.054056 -8.3188813 0.0000000 0.5820706 0.7155173
prop_type_simplifiedPrivate room in rental unit 0.3385796 1.052435 -21.1910134 0.0000000 0.3063017 0.3742589
number_of_reviews 0.9992673 1.000078 -9.4143290 0.0000000 0.9991148 0.9994198
review_scores_rating 1.1153453 1.014766 7.4472350 0.0000000 1.0837508 1.1478609

Considering that we use logprice_4_nights as our outcome variable. We use exp() to get the estimate, standard error and confidence interval for price_4_nights.

According to the p-value, all the explanatory variables except prop_type_simplifiedEntire rental unit are significant.

review_scores_rating - Coefficient of review_scores_rating in terms of price_4_nights is 1.1153. If review_scores_rating increases by 1, price_4_nights will increase by 11.53%.

prop_type_simplified - Coefficient of prop_type_simplifiedEntire rental unit, in terms of price_4_nights is 0.9784. If property type changes from Entire condominium (condo) to Entire rental unit, price_4_nights will decrease by 2.16%. - Coefficient of prop_type_simplifiedEntire serviced apartment, in terms of price_4_nights is 1.3557. If property type changes from Entire condominium (condo) to Entire serviced apartment, price_4_nights will increase by 35.57%. - Coefficient of prop_type_simplifiedOther, in terms of price_4_nights is 0.6454. If property type changes from Entire condominium (condo) to Other, price_4_nights will decrease by 35.46%. - Coefficient of prop_type_simplifiedPrivate room in rental unit, in terms of price_4_nights is 0.3386. If property type changes from Entire condominium (condo) to Private room in rental unit, price_4_nights will decrease by 66.14%.

model1 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.4401521 0.4395462 0.4863421 726.4484 0 6
autoplot(model1, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model1 is 0.4395, which means that the selected variables are able to explain 43.95% of the variation in prices. This is a good start but we will run other models and test other variables in order to improve this. Residuals of model1 do not obey Normal Distribution well, we can see some patterns, which let us think that we might be missing key explanatory variables for the prediction of the price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model1, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model1, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model1)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model1, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 3736.565 0.4401521
Testing 2676.658 0.4070132

Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is higher than the testing one.

5.4 Model 2

Then, we want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We fit model2 with a regression model called model1 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, review_scores_rating and room_rype as explanatory variables.

model2 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data = barcelona_analysis)
car::vif(model2)
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 12.946270  4        1.377267
number_of_reviews     1.061837  1        1.030455
review_scores_rating  1.047441  1        1.023446
room_type            12.943643  3        1.532296
model2 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 347.7648012 1.082191 74.0815208 0.0000000 297.8766839 406.0081352
prop_type_simplifiedEntire rental unit 0.9778766 1.049262 -0.4652412 0.6417770 0.8899052 1.0745443
prop_type_simplifiedEntire serviced apartment 1.3518788 1.058807 5.2761872 0.0000001 1.2086129 1.5121271
prop_type_simplifiedOther 0.9884007 1.063695 -0.1889427 0.8501446 0.8757105 1.1155924
prop_type_simplifiedPrivate room in rental unit 0.6362133 1.070603 -6.6286925 0.0000000 0.5565696 0.7272537
number_of_reviews 0.9991508 1.000076 -11.1725409 0.0000000 0.9990019 0.9992998
review_scores_rating 1.1131383 1.014329 7.5337097 0.0000000 1.0825209 1.1446216
room_typeHotel room 0.9623769 1.064049 -0.6177238 0.5367828 0.8520986 1.0869273
room_typePrivate room 0.5311568 1.048137 -13.4574829 0.0000000 0.4843899 0.5824389
room_typeShared room 0.2865566 1.095642 -13.6830693 0.0000000 0.2395765 0.3427493

According to p-value, room_type variables are all significant but room_typeHotel room. We see that changes to Hotel, Private or Shared room will result in a decrease in price of 3.8%, 46.9% and 71.4% respectively. This is coherent with what we would expect.

model2 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.4733089 0.4724534 0.4718483 553.2664 0 9
autoplot(model2, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model2 is 0.4718, which means that the selected variables are able to explain 47.18%% of the variation in prices. This is an improvement compared to model1 (43.95%). Residuals of model2 do not obey Normal Distribution well, we can see some patterns as for model2, which let us think that we might still be missing key explanatory variables for the prediction of the price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model2, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model2, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model2)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model2, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 3593.001 0.4733089
Testing 2551.835 0.4354227

Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is higher than the testing one.

5.5 Model 3

Our dataset has many more variables, so we are trying to find whether other variables are significant predictors of price_4_nights.

We want to know if bathrooms, bedrooms, beds, accomodates are significant predictors of price_4_nights. We fit model3 with a regression model called model1 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, review_scores_rating, room_rype, bathrooms, bedrooms, beds and accomodatesas explanatory variables.

model3 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + bedrooms + beds + accommodates, 
             data = barcelona_analysis)
car::vif(model3)
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.313991  4        1.382097
number_of_reviews     1.067452  1        1.033176
review_scores_rating  1.052516  1        1.025922
room_type            14.141768  3        1.555072
bathrooms             1.825769  1        1.351210
bedrooms              4.299252  1        2.073464
beds                  4.399205  1        2.097428
accommodates          5.699291  1        2.387319
model3 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 179.9050961 1.074915 71.8758554 0.0000000 156.1485040 207.2760403
prop_type_simplifiedEntire rental unit 0.9609921 1.043610 -0.9321436 0.3513030 0.8838485 1.0448689
prop_type_simplifiedEntire serviced apartment 1.3452082 1.052042 5.8452345 0.0000000 1.2178553 1.4858785
prop_type_simplifiedOther 0.9625259 1.056353 -0.6966919 0.4860249 0.8644448 1.0717353
prop_type_simplifiedPrivate room in rental unit 0.6267396 1.062494 -7.7075651 0.0000000 0.5565151 0.7058255
number_of_reviews 0.9993282 1.000068 -9.9360345 0.0000000 0.9991957 0.9994607
review_scores_rating 1.1210482 1.012733 9.0308484 0.0000000 1.0935836 1.1492026
room_typeHotel room 1.2218695 1.057051 3.6115794 0.0003070 1.0959408 1.3622679
room_typePrivate room 0.7090163 1.043362 -8.1011432 0.0000000 0.6524039 0.7705412
room_typeShared room 0.2876998 1.086365 -15.0395481 0.0000000 0.2445754 0.3384280
bathrooms 1.1447843 1.011703 11.6212057 0.0000000 1.1189676 1.1711967
bedrooms 1.0438326 1.010597 4.0695368 0.0000478 1.0224826 1.0656284
beds 0.9931130 1.005456 -1.2700306 0.2041271 0.9825754 1.0037636
accommodates 1.0698846 1.005727 11.8297469 0.0000000 1.0579748 1.0819286

According to the p-value, beds is non-significant variable and we should remove it from the model. Other variables are significant. An increase (by 1) in the number of accomodates, bedrooms and bathroomsall result in an increase in price of 6.99%, 4.38% and 14.48% respectively.

To make GVIFs comparable across variables, we can use the Square of GVIF^(1/(2*DF)). This reduces GVIF to a linear measure. If the calculated value is less than 5, then we can conclude that their is no collinearity. accomodates has Square of (2.352034)^2 = 5.7 >5, indicating collinearity. Therefore, we need to remove it from the model.

model3 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.5857428 0.5847702 0.4186164 602.2379 0 13
autoplot(model3, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model3 is 0.5848, which means that the selected variables are able to explain 58.48% of the variation in prices. This is an improvement compared to model2 (47.18%). Residuals of model2 roughly obey Normal Distribution well, we can see less patterns than for model2, but we might still be missing explanatory variables to improve the prediction of the price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model3, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model3, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model3)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model3, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 2710.236 0.5857428
Testing 1897.556 0.5968172

Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.

5.6 Model 4

We want to know if host_is_superhostcommands a pricing premium, after controlling for other variables. We fit model4 with a regression model called model4 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, review_scores_rating, room_rype, bathrooms, and host_is_superhostas explanatory variables.

model4 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + host_is_superhost, 
             data = barcelona_analysis)
car::vif(model4)
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.057355  4        1.378738
number_of_reviews     1.129981  1        1.063006
review_scores_rating  1.113245  1        1.055104
room_type            13.172275  3        1.536774
bathrooms             1.059876  1        1.029503
host_is_superhost     1.159588  1        1.076841
model4 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 240.0513800 1.077429 73.4915368 0.0000000 207.4003143 277.8427083
prop_type_simplifiedEntire rental unit 0.9951711 1.045371 -0.1090917 0.9131337 0.9122628 1.0856142
prop_type_simplifiedEntire serviced apartment 1.3201377 1.054113 5.2701927 0.0000001 1.1905609 1.4638173
prop_type_simplifiedOther 0.9454133 1.058589 -0.9858815 0.3242343 0.8455637 1.0570539
prop_type_simplifiedPrivate room in rental unit 0.5977301 1.064920 -8.1814936 0.0000000 0.5283878 0.6761723
number_of_reviews 0.9992442 1.000072 -10.4643353 0.0000000 0.9991027 0.9993857
review_scores_rating 1.0939414 1.013603 6.6455171 0.0000000 1.0653468 1.1233035
room_typeHotel room 1.1047255 1.059039 1.7362935 0.0825675 0.9872276 1.2362079
room_typePrivate room 0.6264124 1.044590 -10.7222237 0.0000000 0.5750683 0.6823408
room_typeShared room 0.2915437 1.087809 -14.6445338 0.0000000 0.2471986 0.3438437
bathrooms 1.3346095 1.009247 31.3580852 0.0000000 1.3107429 1.3589106
host_is_superhostTRUE 1.0448375 1.015416 2.8670650 0.0041586 1.0139672 1.0766476

According to the p-value, host_is_superhostis significant. It also commands a small pricing premium, after controlling for other variables (4.48%).

model4 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.5532422 0.5523549 0.4346491 623.5651 0 11
autoplot(model4, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model4 is 0.5524, which means that the selected variables are able to explain 55.24% of the variation in prices. This is a decrease compared to model3 (58.48%), meaning that host_is_superhost should probably not be included. Residuals of model4 display patterns again, which let us think that we might still be missing key explanatory variables for the prediction of the price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model4, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model4, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model4)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model4, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 3087.457 0.5532422
Testing 1940.272 0.5633311

Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.

5.7 Model 5

model5 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms +  instant_bookable, 
             data = barcelona_analysis)
car::vif(model5)
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.136632  4        1.379782
number_of_reviews     1.069493  1        1.034163
review_scores_rating  1.049064  1        1.024238
room_type            13.249696  3        1.538276
bathrooms             1.061395  1        1.030240
instant_bookable      1.029694  1        1.014738
model5 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 225.0470317 1.077086 72.9376580 0.0000000 194.5582837 260.3135961
prop_type_simplifiedEntire rental unit 0.9828325 1.045267 -0.3911397 0.6957090 0.9011284 1.0719445
prop_type_simplifiedEntire serviced apartment 1.3020667 1.054034 5.0157923 0.0000005 1.1744365 1.4435671
prop_type_simplifiedOther 0.9365372 1.058506 -1.1531391 0.2489031 0.8377534 1.0469691
prop_type_simplifiedPrivate room in rental unit 0.5991242 1.064860 -8.1518208 0.0000000 0.5296792 0.6776739
number_of_reviews 0.9993097 1.000070 -9.8324013 0.0000000 0.9991721 0.9994473
review_scores_rating 1.1063802 1.013190 7.7148642 0.0000000 1.0783209 1.1351696
room_typeHotel room 1.0791701 1.059163 1.3255666 0.1850380 0.9641685 1.2078884
room_typePrivate room 0.6218689 1.044571 -10.8935827 0.0000000 0.5709173 0.6773677
room_typeShared room 0.2893103 1.087699 -14.7535725 0.0000000 0.2453535 0.3411424
bathrooms 1.3365026 1.009245 31.5182290 0.0000000 1.3126068 1.3608334
instant_bookableTRUE 1.0522043 1.011955 4.2819456 0.0000188 1.0279738 1.0770059

instant_bookable is significant according to its p-value. Also a change in that logical variable from False to True results in and 5.22% increase in price.

model5 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.5540553 0.5531697 0.4342534 625.6203 0 11
autoplot(model5, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model5 is 0.5532, which means that the selected variables are able to explain 55.32% of the variation in prices. This is a slight increase compared to model4 (55.24%), meaning that instant_bookable should probably be included. Residuals of model5 still display patterns, which let us think that we might still be missing key explanatory variables for the prediction of the price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model5, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model5, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model5)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model5, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 3082.670 0.5540553
Testing 1933.803 0.5639097

Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.

5.8 Model 6

model6 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + instant_bookable + neighbourhood_simplified, 
             data = barcelona_analysis)
car::vif(model6)
                              GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     13.339522  4        1.382428
number_of_reviews         1.070205  1        1.034507
review_scores_rating      1.050618  1        1.024997
room_type                13.444714  3        1.542027
bathrooms                 1.072098  1        1.035421
instant_bookable          1.032118  1        1.015932
neighbourhood_simplified  1.041735  3        1.006838
model6 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 248.8618797 1.076417 74.9191495 0.0000000 215.4090132 287.5099525
prop_type_simplifiedEntire rental unit 0.9817790 1.044631 -0.4211485 0.6736630 0.9012362 1.0695197
prop_type_simplifiedEntire serviced apartment 1.2799448 1.053309 4.7522225 0.0000021 1.1560392 1.4171307
prop_type_simplifiedOther 0.9761745 1.057768 -0.4293704 0.6676704 0.8744049 1.0897888
prop_type_simplifiedPrivate room in rental unit 0.6248330 1.064002 -7.5804872 0.0000000 0.5532818 0.7056373
number_of_reviews 0.9993208 1.000069 -9.8121777 0.0000000 0.9991851 0.9994565
review_scores_rating 1.1011566 1.013009 7.4552807 0.0000000 1.0736055 1.1294148
room_typeHotel room 1.0044313 1.058614 0.0776248 0.9381293 0.8983072 1.1230927
room_typePrivate room 0.5939059 1.044081 -12.0786472 0.0000000 0.5457471 0.6463143
room_typeShared room 0.2655677 1.086686 -15.9490452 0.0000000 0.2256302 0.3125743
bathrooms 1.3214794 1.009158 30.5772085 0.0000000 1.2980723 1.3453086
instant_bookableTRUE 1.0453513 1.011796 3.7820136 0.0001572 1.0215926 1.0696625
neighbourhood_simplifiedCoastal line outside Center 0.8679975 1.014634 -9.7443160 0.0000000 0.8436250 0.8930742
neighbourhood_simplifiedNorth 0.8170994 1.024871 -8.2222764 0.0000000 0.7786801 0.8574143
neighbourhood_simplifiedWest 0.8676094 1.019039 -7.5298560 0.0000000 0.8361169 0.9002881

The neighbourhood variables are all significant according to p-value. Changing to North, Coastal outside Center or West will induce a decrease in price of 18.29%, 13.20%, 13.24% respectively.

model6 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.5670004 0.5659054 0.4280201 517.8023 0 14
autoplot(model6, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model6 is 0.5659, which means that the selected variables are able to explain 56.59% of the variation in prices. This is a slight increase compared to model5 (55.32%), meaning that neighbourhood variables should probably be included. Residuals of model5 still display patterns but less, which let us think that we might still be missing key explanatory variables for the prediction of the price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model6, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model6, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model6)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model6, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 3006.856 0.5670004
Testing 1896.154 0.5705083

Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.

5.9 Model 7

We want to know the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables.

model7 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + instant_bookable + neighbourhood_simplified + availability_30 + reviews_per_month,
             data = barcelona_analysis)
car::vif(model7)
                              GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     13.667818  4        1.386636
number_of_reviews         1.364425  1        1.168086
review_scores_rating      1.057701  1        1.028446
room_type                13.693264  3        1.546742
bathrooms                 1.076702  1        1.037643
instant_bookable          1.035937  1        1.017810
neighbourhood_simplified  1.054034  3        1.008809
availability_30           1.055266  1        1.027261
reviews_per_month         1.327783  1        1.152295
model7 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 222.2066430 1.074983 74.7334080 0.0000000 192.8400998 256.0452533
prop_type_simplifiedEntire rental unit 0.9694117 1.043443 -0.7305194 0.4651037 0.8918716 1.0536932
prop_type_simplifiedEntire serviced apartment 1.2325031 1.051877 4.1333509 0.0000363 1.1161652 1.3609669
prop_type_simplifiedOther 0.9643810 1.056159 -0.6637946 0.5068494 0.8664231 1.0734141
prop_type_simplifiedPrivate room in rental unit 0.6212936 1.062283 -7.8772842 0.0000000 0.5518936 0.6994205
number_of_reviews 0.9997678 1.000076 -3.0535164 0.0022725 0.9996187 0.9999169
review_scores_rating 1.1205088 1.012696 9.0191881 0.0000000 1.0931367 1.1485663
room_typeHotel room 0.9186459 1.057236 -1.5245670 0.1274243 0.8236856 1.0245538
room_typePrivate room 0.5982460 1.042917 -12.2260312 0.0000000 0.5509389 0.6496152
room_typeShared room 0.2502034 1.084310 -17.1166199 0.0000000 0.2134907 0.2932295
bathrooms 1.3083512 1.008927 30.2422213 0.0000000 1.2857541 1.3313455
instant_bookableTRUE 1.0511371 1.011495 4.3636186 0.0000130 1.0278476 1.0749543
neighbourhood_simplifiedCoastal line outside Center 0.8557375 1.014285 -10.9835164 0.0000000 0.8322703 0.8798665
neighbourhood_simplifiedNorth 0.8197943 1.024236 -8.2974532 0.0000000 0.7821975 0.8591982
neighbourhood_simplifiedWest 0.8626368 1.018544 -8.0418359 0.0000000 0.8321172 0.8942758
availability_30 1.0083305 1.000572 14.5063939 0.0000000 1.0072007 1.0094616
reviews_per_month 0.9583110 1.004137 -10.3137990 0.0000000 0.9505858 0.9660990

avalability_30 and reviews_per_monthare both significant according to p-value. Surprisingly an increase in reviews_per_month results in a 4.17% decrease in price. Increase in availability only results in a very slight increase in price.

model7 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.5904035 0.5892193 0.4163676 498.5536 0 16
autoplot(model7, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of model7 is 0.5892, which means that the selected variables are able to explain 58.92% of the variation in prices. This is a slight increase compared to model6 (56.59%), meaning that avalability_30 and reviews_per_month should probably be included. Residuals of model7 show an improvement, close to Normal Distribution and less patterns, which is good, howver we might still be able to improve the price’s prediction.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(model7, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(model7, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(model7)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(model7, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 2884.986 0.5904035
Testing 1781.748 0.5759478

Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is now higher than the testing one, which is a good sign.

6 Best Model

6.1 Setting and Analysis

best_model <- lm(logprice_4_nights ~ prop_type_simplified + review_scores_rating + room_type + bathrooms + bedrooms + accommodates + instant_bookable + neighbourhood_simplified + availability_30 + reviews_per_month + host_is_superhost + host_response_rate + host_acceptance_rate + review_scores_cleanliness + review_scores_checkin + review_scores_location + review_scores_value,
             data = barcelona_analysis)
msummary(best_model)
                                                      Estimate Std. Error
(Intercept)                                          5.1892383  0.0993132
prop_type_simplifiedEntire rental unit              -0.0371555  0.0408635
prop_type_simplifiedEntire serviced apartment        0.2539258  0.0485931
prop_type_simplifiedOther                            0.0018282  0.0524648
prop_type_simplifiedPrivate room in rental unit     -0.4156010  0.0579952
review_scores_rating                                 0.1452613  0.0261472
room_typeHotel room                                  0.0369883  0.0537111
room_typePrivate room                               -0.3810986  0.0406229
room_typeShared room                                -1.3119238  0.0792264
bathrooms                                            0.1281109  0.0111453
bedrooms                                             0.0524808  0.0095358
accommodates                                         0.0527500  0.0046754
instant_bookableTRUE                                 0.0458857  0.0123039
neighbourhood_simplifiedCoastal line outside Center -0.1451596  0.0138394
neighbourhood_simplifiedNorth                       -0.1692862  0.0234309
neighbourhood_simplifiedWest                        -0.1303869  0.0177368
availability_30                                      0.0069505  0.0005574
reviews_per_month                                   -0.0431788  0.0036257
host_is_superhostTRUE                                0.0763412  0.0140296
host_response_rate                                  -0.2639991  0.0388522
host_acceptance_rate                                 0.1193673  0.0294330
review_scores_cleanliness                            0.0736749  0.0185538
review_scores_checkin                               -0.0626868  0.0175955
review_scores_location                               0.0948212  0.0219234
review_scores_value                                 -0.1114280  0.0232620
                                                    t value Pr(>|t|)    
(Intercept)                                          52.251  < 2e-16 ***
prop_type_simplifiedEntire rental unit               -0.909 0.363253    
prop_type_simplifiedEntire serviced apartment         5.226 1.80e-07 ***
prop_type_simplifiedOther                             0.035 0.972204    
prop_type_simplifiedPrivate room in rental unit      -7.166 8.73e-13 ***
review_scores_rating                                  5.556 2.90e-08 ***
room_typeHotel room                                   0.689 0.491071    
room_typePrivate room                                -9.381  < 2e-16 ***
room_typeShared room                                -16.559  < 2e-16 ***
bathrooms                                            11.495  < 2e-16 ***
bedrooms                                              5.504 3.89e-08 ***
accommodates                                         11.283  < 2e-16 ***
instant_bookableTRUE                                  3.729 0.000194 ***
neighbourhood_simplifiedCoastal line outside Center -10.489  < 2e-16 ***
neighbourhood_simplifiedNorth                        -7.225 5.69e-13 ***
neighbourhood_simplifiedWest                         -7.351 2.25e-13 ***
availability_30                                      12.470  < 2e-16 ***
reviews_per_month                                   -11.909  < 2e-16 ***
host_is_superhostTRUE                                 5.441 5.51e-08 ***
host_response_rate                                   -6.795 1.20e-11 ***
host_acceptance_rate                                  4.056 5.07e-05 ***
review_scores_cleanliness                             3.971 7.25e-05 ***
review_scores_checkin                                -3.563 0.000370 ***
review_scores_location                                4.325 1.55e-05 ***
review_scores_value                                  -4.790 1.71e-06 ***

Residual standard error: 0.3991 on 5526 degrees of freedom
Multiple R-squared:  0.6241,    Adjusted R-squared:  0.6225 
F-statistic: 382.3 on 24 and 5526 DF,  p-value: < 2.2e-16
car::vif(best_model)
                               GVIF Df GVIF^(1/(2*Df))
prop_type_simplified      13.972116  4        1.390458
review_scores_rating       4.944021  1        2.223516
room_type                 14.707991  3        1.565281
bathrooms                  1.842621  1        1.357432
bedrooms                   3.869552  1        1.967118
accommodates               4.202473  1        2.049993
instant_bookable           1.306395  1        1.142976
neighbourhood_simplified   1.130983  3        1.020726
availability_30            1.090789  1        1.044408
reviews_per_month          1.114209  1        1.055561
host_is_superhost          1.156422  1        1.075371
host_response_rate         1.358279  1        1.165452
host_acceptance_rate       1.572721  1        1.254082
review_scores_cleanliness  2.508431  1        1.583803
review_scores_checkin      2.026785  1        1.423652
review_scores_location     1.854451  1        1.361782
review_scores_value        3.906312  1        1.976439
best_model %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), 
         std.error = exp(std.error),
         conf.low=exp(conf.low),
         conf.high=exp(conf.high)
         ) %>% 
    kbl(col.names=c("Variable",
                    "Estimate",
                    "SE",
                    "t-stat",
                    "p-value",
                    "Lower CI",
                    "Upper CI")
        ) %>%
  kable_styling()
Variable Estimate SE t-stat p-value Lower CI Upper CI
(Intercept) 179.3319056 1.104412 52.2512243 0.0000000 147.6058131 217.8771397
prop_type_simplifiedEntire rental unit 0.9635263 1.041710 -0.9092598 0.3632527 0.8893504 1.0438888
prop_type_simplifiedEntire serviced apartment 1.2890762 1.049793 5.2255577 0.0000002 1.1719445 1.4179148
prop_type_simplifiedOther 1.0018299 1.053865 0.0348460 0.9722038 0.9039119 1.1103550
prop_type_simplifiedPrivate room in rental unit 0.6599435 1.059710 -7.1661332 0.0000000 0.5890205 0.7394063
review_scores_rating 1.1563417 1.026492 5.5555156 0.0000000 1.0985624 1.2171598
room_typeHotel room 1.0376809 1.055180 0.6886525 0.4910709 0.9339741 1.1529030
room_typePrivate room 0.6831105 1.041459 -9.3813712 0.0000000 0.6308195 0.7397361
room_typeShared room 0.2693015 1.082449 -16.5591823 0.0000000 0.2305613 0.3145510
bathrooms 1.1366790 1.011208 11.4946330 0.0000000 1.1121129 1.1617878
bedrooms 1.0538824 1.009581 5.5035776 0.0000000 1.0343642 1.0737688
accommodates 1.0541661 1.004686 11.2825571 0.0000000 1.0445483 1.0638725
instant_bookableTRUE 1.0469548 1.012380 3.7293680 0.0001939 1.0220039 1.0725148
neighbourhood_simplifiedCoastal line outside Center 0.8648843 1.013936 -10.4888346 0.0000000 0.8417348 0.8886704
neighbourhood_simplifiedNorth 0.8442672 1.023708 -7.2249056 0.0000000 0.8063640 0.8839521
neighbourhood_simplifiedWest 0.8777557 1.017895 -7.3512104 0.0000000 0.8477597 0.9088131
availability_30 1.0069747 1.000557 12.4698972 0.0000000 1.0058750 1.0080757
reviews_per_month 0.9577402 1.003632 -11.9090355 0.0000000 0.9509568 0.9645719
host_is_superhostTRUE 1.0793308 1.014128 5.4414237 0.0000001 1.0500498 1.1094283
host_response_rate 0.7679742 1.039617 -6.7949653 0.0000000 0.7116532 0.8287526
host_acceptance_rate 1.1267837 1.029870 4.0555531 0.0000507 1.0636080 1.1937118
review_scores_cleanliness 1.0764568 1.018727 3.9708901 0.0000725 1.0380067 1.1163312
review_scores_checkin 0.9392376 1.017751 -3.5626657 0.0003702 0.9073919 0.9722010
review_scores_location 1.0994623 1.022166 4.3251185 0.0000155 1.0532101 1.1477456
review_scores_value 0.8945558 1.023535 -4.7901268 0.0000017 0.8546778 0.9362944

All variables but prop_type_simplifiedEntire rental unit, prop_type_simplifiedOther and room_typeHotel room are significant according to their p-values. Changes in these variables all result in either a decrease or an increase in price, which is more or less important.

best_model %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("R Squared",
                    "Adj. R Squared",
                    "Sigma",
                    "t-stat",
                    "p-value",
                    "Df")
        ) %>%
  kable_styling()
R Squared Adj. R Squared Sigma t-stat p-value Df
0.6241258 0.6224933 0.3991483 382.322 0 24
autoplot(best_model, 
         alpha = 0.2, 
         label.size = 3) +
  theme_minimal()

The adjusted Rsquared of our best model is 0.6225, which means that the selected variables are able to explain 62.25% of the variation in prices. This is a very good prediction model, higher than all models 1 to 7 that we obtained before. Moreover, the residuals of the model approximately follow a Normal Distribution, there is no clear pattern in the residuals. Thus, it seems that this model includes all key outcome variables to obtain a regression model that provides a good prediction of the analysed price.

#Constructing our predictions for the model and RMSE train from training data
train_rmse <- 
  airbnb_listing_train %>% 
  mutate(predictions = exp(predict(best_model, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
  airbnb_listing_test %>% 
  mutate(predictions = exp(predict(best_model, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#Training model output on RSquared
train_r_squared <-
  summary(best_model)$r.squared

#Testing model output on Rsquared
test_r_squared <-
  cor(predict(best_model, airbnb_listing_test), airbnb_listing_test$price_4_nights)

#Creating a matrix from output values
kbl(matrix(c(train_rmse, 
             test_rmse, 
             train_r_squared, 
             test_r_squared),
           nrow = 2,
           dimnames = list(c("Training","Testing"),
                           c("RMSE","RSquared"))),
  ) %>% 
  kable_styling()
RMSE RSquared
Training 2524.989 0.6241258
Testing 1695.712 0.6091837

Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is higher than the testing one, thus we do not have overfitting issues in our model.

6.2 Diagnostics, collinearity, summary tables

huxreg(model1, model2, model3, model4, model5, model6, model7, best_model,
       coefs=c("Property Type - Entire Unit"="prop_type_simplifiedEntire rental unit",
               "Property Type - Entire Unit"="prop_type_simplifiedEntire serviced apartment",
               "Property Type - Entire Unit"="prop_type_simplifiedPrivate room in rental unit",
               "Property Type - Entire Unit"="prop_type_simplifiedOther",
               "Number of Reviews"="number_of_reviews",
               "Rating Score" = "review_scores_rating",
               "Room Type - Hotel" = "room_typeHotel room",
               "Room Type - Private Room"="room_typePrivate room",
               "Room Type - Shared Room"="room_typeShared room",
               "Number of Bathrooms"="bathrooms",
               "Number of Bedrooms"="bedrooms",
               "Number of Beds"="beds",
               "Accommodates"="accommodates",
               "Host is Superhost - TRUE"="host_is_superhostTRUE",
               "Instant Bookable - TRUE"="instant_bookableTRUE",
               "Neighbourhood - Coastal Line outside Center" = "neighbourhood_simplifiedCoastal line outside Center",
               "Neighbourhood - West"="neighbourhood_simplifiedWest",
               "Neighbourhood - North"="neighbourhood_simplifiedNorth",
               "Availability next 30 days" = "availability_30",
               "Reviews per month" = "reviews_per_month"))
(1)(2)(3)(4)(5)(6)(7)(8)
Property Type - Entire Unit-0.022    -0.022    -0.040    -0.005    -0.017    -0.018    -0.031    -0.037    
(0.050)   (0.048)   (0.043)   (0.044)   (0.044)   (0.044)   (0.043)   (0.041)   
Number of Reviews-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.000 **         
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)           
Rating Score0.109 ***0.107 ***0.114 ***0.090 ***0.101 ***0.096 ***0.114 ***0.145 ***
(0.015)   (0.014)   (0.013)   (0.014)   (0.013)   (0.013)   (0.013)   (0.026)   
Room Type - Hotel        -0.038    0.200 ***0.100    0.076    0.004    -0.085    0.037    
        (0.062)   (0.055)   (0.057)   (0.057)   (0.057)   (0.056)   (0.054)   
Room Type - Private Room        -0.633 ***-0.344 ***-0.468 ***-0.475 ***-0.521 ***-0.514 ***-0.381 ***
        (0.047)   (0.042)   (0.044)   (0.044)   (0.043)   (0.042)   (0.041)   
Room Type - Shared Room        -1.250 ***-1.246 ***-1.233 ***-1.240 ***-1.326 ***-1.385 ***-1.312 ***
        (0.091)   (0.083)   (0.084)   (0.084)   (0.083)   (0.081)   (0.079)   
Number of Bathrooms                0.135 ***0.289 ***0.290 ***0.279 ***0.269 ***0.128 ***
                (0.012)   (0.009)   (0.009)   (0.009)   (0.009)   (0.011)   
Number of Bedrooms                0.043 ***                                0.052 ***
                (0.011)                                   (0.010)   
Number of Beds                -0.007                                            
                (0.005)                                           
Accommodates                0.068 ***                                0.053 ***
                (0.006)                                   (0.005)   
Host is Superhost - TRUE                        0.044 **                         0.076 ***
                        (0.015)                           (0.014)   
Instant Bookable - TRUE                                0.051 ***0.044 ***0.050 ***0.046 ***
                                (0.012)   (0.012)   (0.011)   (0.012)   
Neighbourhood - Coastal Line outside Center                                        -0.142 ***-0.156 ***-0.145 ***
                                        (0.015)   (0.014)   (0.014)   
Neighbourhood - West                                        -0.142 ***-0.148 ***-0.130 ***
                                        (0.019)   (0.018)   (0.018)   
Neighbourhood - North                                        -0.202 ***-0.199 ***-0.169 ***
                                        (0.025)   (0.024)   (0.023)   
Availability next 30 days                                                0.008 ***0.007 ***
                                                (0.001)   (0.001)   
Reviews per month                                                -0.043 ***-0.043 ***
                                                (0.004)   (0.004)   
N5551        5551        5551        5551        5551        5551        5551        5551        
R20.440    0.473    0.586    0.553    0.554    0.567    0.590    0.624    
logLik-3871.626    -3702.179    -3035.706    -3245.339    -3240.282    -3158.521    -3004.303    -2765.837    
AIC7759.251    7426.358    6101.412    6516.677    6506.565    6349.042    6044.605    5583.674    
*** p < 0.001; ** p < 0.01; * p < 0.05.

The best model is preferable to other choices, as it offers the highest Rsquared, does not present any overfitting issue and has residuals that do not show clear patterns. In conclusion, it includes the most important outcome variables and enables good prediction of the variable of interest which is the price.

6.3 Prediction

In this last part, we use our best model to predict the total cost of an Airbnb accommodation for a visit in Barcelona over reading week. More specifically, we want Airbnb’s in Barcelona that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90.

# we select: an Entire serviced apartment, which has at least 10 reviews, and an average rating of at least 4,5/5 to conduct our prediction
finally_done <- barcelona_analysis %>% 
  filter(property_type == 'Entire serviced apartment',
         review_scores_rating >= 4.5,
         number_of_reviews >= 10) %>% 
  
  # we take the exponential for prediction because we used log(y) in our model
  mutate(prediction = exp(predict(best_model, .)))
# we construct a confidence interval
upper_ci <- quantile(finally_done$prediction, 0.975)
lower_ci <- quantile(finally_done$prediction, 0.025)

# we calculate the mean
average_prediction <- mean(finally_done$prediction)

# we visualise
ggplot(finally_done, aes(x = prediction)) +
  geom_histogram() +
  geom_vline(xintercept = average_prediction, size = 0.7, color = 'red') + 
  geom_vline(xintercept = upper_ci, size = 0.7, color = 'blue') +
  geom_vline(xintercept = lower_ci, size = 0.7, color = 'blue')  +
  labs(y = "count", x = "Price: 2 Guests for 4 Nights", 
       title = 'Estimate and CI for the Price of a Stay in a Barcelona AirBnB',
       subtitle = 'Histogram of the Estimated Price') +
  scale_x_continuous()+
  theme_bw()+
  NULL

6.4 Recommendations and Next Steps

After reaching the conclusion of our case-study, we recommend the couple to travel to Barcelona and select a property that has at least a 4.5* rating, has more than 10 reviews in total, and is an entire serviced apartment. These are the best predictors of price and will give the most satisfying experience to the couple.

For future analysis, we would recommend introducing a following variable that we would call “proximity_to_center”. Based on our Heat and Cluster Map, we see that the closer the property is located to the center of Barcelona, the more expensive the 4 night rate is. We can also see on the cluster map, that the more central the location is the more properties there are available. Creating this variable could be done through coming up a new variable using the combination of longitude and latitude points. I believe that this would have significant explanatory power in predicting price.

We also have to be mindful, that the Covid-19 Pandemic distorted the data-set a lot, so in the future, we would want to exclude those datapoints from the set in order to exclude outliers and bias from our data set. To eliminate even more bias from the data-set it is imperative to understand that the houseing market, especially the rental one, is cyclical, with a lot of cyclicality, such as summer and christmas season and over the weekends and public holidays. This could be mitigate by introducing variables that account for cyclicality, or taking long-term period averages that could smooth out high and low periods in the market.

Finally, this is publicly sourced data, that is on the the market uploaded by all kinds of individuals, with a lot of oversight, but fake, scam listings can always make it through the filters. We do not know how these points are treated, nor whether they are removed from the dataset after they have been discovered. This leads to possibly false data, that could further bias our predictions.

7 Acknowledgements